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Abstract 

Methods  that  can  screen  large  databases  to  retrieve  a  structurally 
diverse  set  of  compounds  with  desirable  bioactivity  properties  are 
critical  in  the  drug  discovery  and  development  process.  This  pa¬ 
per  presents  a  set  of  such  methods,  which  are  designed  to  find  com¬ 
pounds  that  are  structurally  different  to  a  certain  query  compound 
while  retaining  its  bioactivity  properties  (scaffold  hops).  These 
methods  utilize  various  indirect  ways  of  measuring  the  similarity 
between  the  query  and  a  compound  that  take  into  account  addi¬ 
tional  information  beyond  their  structure-based  similarities.  Two 
sets  of  techniques  are  presented  that  capture  these  indirect  simi¬ 
larities  using  approaches  based  on  automatic  relevance  feedback 
and  on  analyzing  the  similarity  network  formed  by  the  query  and 
the  database  compounds.  Experimental  evaluation  shows  that  many 
of  these  methods  substantially  outperform  previously  developed  ap¬ 
proaches  both  in  terms  of  their  ability  to  identify  structurally  diverse 
active  compounds  as  well  as  active  compounds  in  general. 
Keywords:  descriptor-space,  ranked-retrieval,  scaffold-hopping, 
virtual  screening. 

1  Introduction 

Discovery,  design,  and  development  of  new  drugs  is  an  ex¬ 
pensive  and  challenging  process.  Any  new  drug  should  not 
only  produce  the  desired  response  to  the  disease  but  should 
do  so  with  minimal  side  effects.  One  of  the  key  steps  in  the 
drug  design  process  is  the  identification  of  the  chemical  com¬ 
pounds  ( hit  compounds  or  just  hits)  that  display  the  desired 
and  reproducible  activity  against  the  specific  biomolecular 
target  [23].  This  represents  a  significant  hurdle  in  the  early 
stages  of  drug  discovery. 

A  popular  approach  for  finding  these  hits  is  to  use  a  com¬ 
pound,  known  to  possess  some  of  the  desired  activity  prop¬ 
erties,  as  a  reference  and  identify  other  compounds  from  a 
large  compound  database  that  have  a  similar  structure.  This  is 
nothing  more  than  a  ranked-retrieval  using  the  reference  com¬ 
pound  as  a  query.  This  approach  relies  on  the  well-known 
fact  that  compounds  sharing  key  structural  features  will  most 
likely  have  similar  activity  against  a  biomolecular  target.  This 
is  referred  to  as  the  structure  activity  relationship  (SAR)  [9]. 
The  similarity  between  the  compounds  is  usually  computed 
by  first  representing  their  molecular  graph  as  a  vector  in  a 
particular  descriptor-space  and  then  using  a  variety  of  vector- 
based  methods  to  compute  their  similarity  [8,9]. 

However,  the  task  of  identifying  hit  compounds  is  compli¬ 
cated  by  the  fact  that  the  query  might  have  undesirable  prop¬ 
erties  such  as  toxicity,  bad  ADME  (absorption,  distribution. 
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metabolism  and  excretion)  properties,  or  may  be  promiscu¬ 
ous  [17,26].  These  properties  will  also  be  shared  by  most 
of  the  highest  ranked  compounds  as  they  will  correspond  to 
very  similar  structures.  In  order  to  overcome  this  problem, 
it  is  important  to  rank  high  as  many  chemical  compounds 
as  possible  that  not  only  show  the  desired  activity  for  the 
biomolecular  target  but  also  have  different  structures  (come 
from  diverse  chemical  classes  or  chemotypes).  Finding  novel 
chemotype  using  the  information  of  already  known  bioactive 
small  molecules  is  termed  as  scaffold-hopping  [17,27,32], 

In  this  paper  we  address  the  problem  of  scaffold-hopping 
by  developing  a  set  of  techniques  that  measure  the  similar¬ 
ity  between  the  query  and  a  compound  that  take  into  account 
additional  information  beyond  their  structure-based  similar¬ 
ities.  These  indirect  ways  of  measuring  similarity  enables 
the  retrieval  of  compounds  that  are  structurally  different  from 
the  query  but  at  the  same  time  possess  the  desired  bioactivity 
properties.  We  present  two  sets  of  techniques  to  capture  such 
indirect  similarities.  The  first  set,  contains  techniques  that  are 
based  on  automatic  relevance  feedback,  whereas  the  second 
set,  derives  the  indirect  similarities  by  analyzing  the  similar¬ 
ity  network  formed  by  the  query  and  the  database  compounds. 
Both  of  these  sets  of  techniques  operate  on  the  descriptor- 
space  representation  of  the  compounds  and  are  independent 
of  the  of  selected  descriptor-space. 

We  experimentally  evaluate  the  performance  of  these 
methods  using  three  different  descriptor-spaces  and  six  dif¬ 
ferent  datasets.  Our  results  show  that  most  of  these  meth¬ 
ods  are  quite  effective  in  improving  the  scaffold-hopping  per¬ 
formance  over  standard  ranked-retrieval.  Among  them,  the 
methods  based  on  the  similarity-network  perform  the  best 
and  substantially  outperform  previously  developed  scaffold¬ 
hopping  schemes.  Moreover,  even  though  these  methods 
were  created  to  improve  the  scaffold-hopping  performance, 
our  results  show  that  many  of  them  are  quite  effective  in  im¬ 
proving  the  ranked-retrieval  performance  as  well. 

The  rest  of  the  paper  is  organized  as  follows.  Section  2 
describes  the  problems  addressed  in  this  paper.  Section  3  in¬ 
troduces  the  definitions  and  notations  used  in  this  paper.  Sec¬ 
tion  4  introduces  the  various  descriptor-spaces  for  this  prob¬ 
lem.  Section  5  describes  the  methods  developed  in  this  paper. 
Section  6  gives  an  overview  of  the  related  work  in  this  field. 
Section  7  describes  the  materials  used  in  our  experimental 
methodology.  Section  8  compares  and  discusses  the  results 
obtained.  Finally,  Section  9  summarizes  the  results  of  this 
paper. 
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2  Problem  Statement  and  Motivation 

The  ranked-retrieval  and  the  scaffold-hopping  problems  that 
we  consider  in  this  paper  are  defined  as  follows: 

Definition  1  (Ranked-Retrieval  Problem).  Given  a  query  com¬ 
pound ,  rank  the  compounds  in  the  database  based  on  how 
similar  they  are  to  the  query  in  terms  of  their  bioactivity. 

Definition  2  (Scaffold-Hopping  Problem).  Given  a  query  com¬ 
pound  and  a  parameter  k,  retrieve  the  k  compounds  that  are 
similar  to  the  query  in  terms  of  their  bioactivity  but  their 
structure  is  as  dissimilar  as  possible  to  that  of  the  query. 

The  solution  to  the  ranked-retrieval  problem  relies  on  the 
well  known  fact  that  chemical  structure  of  a  compound  relates 
to  its  activity  (SAR)  [9].  As  such,  effective  solutions  can  be 
devised  that  rank  the  compounds  on  the  database  based  on 
how  structurally  similar  they  are  to  the  query. 

However,  for  scaffold-hopping,  the  compounds  retrieved 
must  be  structurally  sufficiently  similar  to  possess  similar 
bioactivity  but  at  the  same  time  must  be  structurally  dissim¬ 
ilar  enough  to  be  a  novel  chemotype.  This  is  a  much  harder 
problem  than  simple  ranked-retrieval  as  it  has  the  additional 
constraint  of  maximizing  dissimilarity  that  runs  counter  to 
SAR. 

Methods  that  have  the  ability  to  rank  higher  the  com¬ 
pounds  that  are  structurally  different  (different  chemotypes) 
have  advantages  over  methods  that  do  not.  They  improve  the 
odds  of  being  able  to  find  a  compound  that  is  not  only  ac¬ 
tive  for  a  biomolecular  target  but  also  has  all  the  other  de¬ 
sired  properties  (non-toxicity,  good  ADME  properties,  target 
specificity,  etc.  [8, 17])  that  the  reference  structure  and  com¬ 
pounds  with  similar  structures  might  not  possess.  One  of  such 
compounds  is  then  more  likely  to  become  a  true  drug  candi¬ 
date.  Furthermore,  scaffold-hopping  is  also  important  from 
the  point  of  view  of  un-patented  chemical  space.  Many  im¬ 
portant  lead  compounds  and  drug  candidates  have  been  al¬ 
ready  patented.  In  order  to  find  new  therapies  and  offer  al¬ 
ternative  treatments  it  is  important  for  a  pharmaceutical  com¬ 
pany  to  discovery  novel  leads  away  from  the  existing  patented 
chemical  space.  Methods  that  perform  scaffold-hopping  can 
achieve  those  objectives. 

3  Definitions  and  Notations 

Throughout  the  paper  we  will  use  D  to  denote  a  database  of 
chemical  compounds,  q  to  denote  a  query  compound,  and  c 
to  denote  a  chemical  compound  present  in  the  database. 

Given  two  compounds  Cj  and  Cj,  we  will  use  sim(ci,  Cj) 
to  denote  their  ( direct )  similarity  which  is  computed  with 
respect  to  their  descriptor-space  representation  by  a  suitable 
similarity  measure. 

Given  a  compound  Ci  and  a  set  of  compounds  A,  we  will 
use  sim(cj,  A)  to  denote  the  average  pairwise  similarity  be¬ 
tween  Ci  and  all  the  compounds  in  A. 

Given  a  query  compound  q,  a  database  D.  and  a  parameter 
k,  we  define  top-fc  to  be  the  k  compounds  in  D  that  are  most 
similar  to  q. 


Given  a  compound  c,  a  set  of  compounds  A,  and  a  similar¬ 
ity  measure,  its  k-nearest-neighbor  list  contains  the  k  com¬ 
pounds  in  A  that  are  most  similar  to  c. 

Finally,  throughout  the  paper  we  will  refer  to  the  task  of 
retrieving  active  compounds  as  ranked-retrieval  and  the  task 
of  retrieving  scaffold-hops  as  scaffold-hopping. 

4  Descriptor  Spaces  for  Ranked-Retrieval 

The  similarity  between  chemical  compounds  is  usually  com¬ 
puted  by  first  transforming  them  into  a  suitable  descriptor- 
space  representation  [8,9].  A  number  of  different  approaches 
have  been  developed  to  represent  each  compound  by  a  set  of 
descriptors.  These  descriptors  can  be  based  on  physiochemi- 
cal  properties  as  well  as  topological  and  geometric  substruc¬ 
tures  (fragments)  [1,3, 12, 18,25,29,31]. 

In  this  study  we  use  three  descriptor-spaces  that  have 
been  shown  to  be  very  effective  in  the  context  of  ranked- 
retrieval  and/or  scaffold-hopping.  These  descriptor-spaces 
are  the  graph  fragments  (GF)  [29],  extended  connectivity  fin¬ 
gerprints  (ECFP)  [18,25],  and  the  extended  reduced  graph 
(ErG)  descriptors  [27]. 

GF  is  a  2D  topology-based  descriptor-space  [29]  that  is 
based  on  all  the  graph  fragments  of  a  molecular  graph  up  to 
a  predefined  size.  ECFP  is  also  a  2D  topological  descriptor- 
space  and  many  flavors  of  these  descriptors  have  been  de¬ 
scribed  by  several  authors  [18,  25].  The  idea  behind  this 
descriptor-space  is  to  capture  the  topology  around  each  atom 
in  the  form  of  shells  whose  radius  (number  of  bonds)  ranges 
from  1  to  l,  where  l  is  a  user  defined  parameter.  We  use  the 
ECZ3  variation  of  ECFP  in  which  each  atom  is  assigned  a 
label  corresponding  to  its  atomic  number  and  the  maximum 
shell  radius  is  set  to  three.  Both  extended  connectivity  finger¬ 
prints  (ECFP)  and  GF  have  been  shown  to  be  highly  effective 
for  the  ranked-retrieval  of  chemical  compounds  [18,29]. 

Extended  reduced  graph  descriptors  (ErG)  is  a  pharma- 
cophoric  descriptor-space.  A  pharmacophore  is  defined  as  a 
critical  3D  or  2D  arrangement  of  molecular  fragments  form¬ 
ing  a  necessary  but  not  sufficient  condition  for  biological 
activity.  The  descriptors  that  rely  only  on  2D  information 
are  called  2D  pharmacophoric  descriptors  whereas  descrip¬ 
tors  that  utilize  3D  information  are  called  3D  pharmacophoric 
descriptors.  ErG  is  a  2D  pharmacophoric  descriptor-space 
that  combines  the  reduced  graphs  [14, 15]  and  binding  prop¬ 
erty  pairs  [22]  to  generate  pharmacophoric  descriptor-space. 
A  detailed  description  on  the  generation  of  these  pharma¬ 
cophoric  descriptors  can  be  found  in  [27]. 

5  Methods 

In  order  to  improve  the  scaffold-hopping  performance  we  de¬ 
veloped  a  set  of  techniques  that  measure  the  similarity  be¬ 
tween  the  query  and  a  compound  by  taking  into  account  ad¬ 
ditional  information  beyond  their  descriptor-space-based  rep¬ 
resentation.  These  methods  are  motivated  by  the  observation 
that  if  a  query  compound  q  is  structurally  similar  to  a  database 
compound  c,  and  C;  is  structurally  similar  to  another  database 
compound  Cj .  then  q  and  c:J  could  be  considered  as  being  sim¬ 
ilar  or  related  even  though  they  may  have  zero  or  very  low  di- 
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rect  similarity.  This  indirect  way  of  measuring  similarity  can 
enable  the  retrieval  of  compounds  that  are  structurally  differ¬ 
ent  from  the  query  but  at  the  same  time,  due  to  associativity, 
possess  the  same  bioactivity  properties  with  the  query. 

We  developed  two  sets  of  techniques  to  capture  such  in¬ 
direct  similarities  that  were  inspired  by  research  in  the  fields 
of  information  retrieval  and  social  network  analysis.  The  first 
set,  contains  techniques  that  use  various  forms  of  automatic 
relevance  feedback  to  identify  a  set  of  compounds  to  be  used 
for  creating  an  indirect  similarity  measure,  whereas  the  sec¬ 
ond  set,  derives  the  indirect  similarities  by  analyzing  the  net¬ 
work  formed  by  a  /.'-nearest-neighbor  graph  representation 
of  the  query  and  the  database  compounds.  Both  of  these 
sets  of  techniques  operate  on  the  descriptor-space  representa¬ 
tion  of  the  compounds  and  are  independent  of  the  of  selected 
descriptor-space. 

5.1  Relevance-Feedback-based  Methods 

5.1.1  Top-fc  Weighting  This  approach,  which  is  based 
on  the  Rochio  [24]  scheme  for  automatic  relevance  feedback, 
first  retrieves  the  top -fc  compounds  for  a  given  query  q  and 
then  uses  these  compounds  to  derive  an  indirect  similarity  be¬ 
tween  q  and  each  of  the  compounds  in  the  database.  Specifi¬ 
cally,  if  A  is  the  initial  set  of  top-fc  compounds,  the  new  sim¬ 
ilarity,  sim^g,  c),  between  q  and  a  compound  c  is  given  by 

simA(<?,  c)  =  a  sim(g,  c)  +  (1  —  a)  sim(c,  A),  (1) 

where  0  <  a  <  1  is  a  user-specified  parameter  that  controls 
the  degree  to  which  the  new  similarity  is  affected  by  the  com¬ 
pounds  in  A.  We  will  refer  to  this  method  as  TopKAvg. 

The  motivation  behind  this  approach  is  that  for  reasonably 
small  values  of  fc,  the  set  A  will  contain  a  relatively  large 
number  of  active  compounds.  Thus,  by  modifying  the  simi¬ 
larity  between  q  and  a  compound  c  to  also  include  how  similar 
c  is  to  the  compounds  in  A,  we  obtain  a  similarity  measure 
that  is  re-enforced  by  A’s  active  compounds.  This  enables 
the  retrieval  of  active  compounds  that  are  similar  to  the  com¬ 
pounds  present  in  A  even  if  their  similarity  to  the  query  is  not 
very  high;  thus,  enabling  scaffold-hopping 

5.1.2  Cluster  Weighting  This  method  is  similar  in 
spirit  to  TopKAvg,  but  employs  a  clustering-based  approach 
to  identify  the  set  of  compounds  to  use  for  automatic  rele¬ 
vance  feedback.  We  will  refer  to  this  scheme  as  ClustWt 
and  consists  of  the  following  four  steps.  First,  it  finds  the 
top-fc  most  similar  compounds  to  a  query  q.  Second,  it  clus¬ 
ters  these  compounds  into  l  =  k/m  sets  {Si, . . . ,  ,S) }  each  of 
size  m  (assuming  that  fc  is  a  multiple  of  to).  Third,  it  selects 
among  these  sets,  the  set  S*  that  has  the  highest  similarity  to 
the  query.  Fourth,  it  uses  Equation  1  to  re -rank  all  the  com¬ 
pounds  in  the  database  using  S*  as  the  relevance  feedback  set 
(i.e.,  A  =  S*). 

The  clustering  is  computed  using  a  fixed-capacity  heuris¬ 
tic  min-cut  partitioning  algorithm  on  the  complete  weighted 
graph  whose  nodes  are  the  fc  compounds  and  the  edge- 
weights  are  the  similarities  between  them  [20,21].  Conse¬ 
quently,  the  inter-cluster  compound-to-compound  similarities 


are  explicitly  minimized  leading  to  clusters  in  which  the  intra¬ 
cluster  similarities  are  implicitly  maximized  (i.e.,  each  cluster 
will  end-up  containing  similar  compounds). 

By  using  for  relevance  feedback  the  set  S* ,  which  contains 
compounds  that  are  most  similar  to  the  query,  ClustWt  se¬ 
lects  the  cluster  that  will  most  likely  have  a  large  number  of 
active  compounds.  This  is  similar  in  spirit  to  the  method  that 
TopKAvg  uses  to  select  its  own  relevance  feedback  set  A. 
Flowever,  since  S*  contains  compounds  that  are  also  very 
similar  to  each-other,  the  number  of  active  compounds  that 
it  contains  will  tend  to  be  higher  than  that  contained  in  A 
(assuming  that  both  A  and  S*  have  the  same  size).  In  fact, 
S*  has  already  incorporated  some  form  of  automatic  rele¬ 
vance  feedback,  since  all  pairwise  similarities  between  its 
compounds  were  taken  into  account  during  the  clustering  pro¬ 
cess.  The  fact  that  objects  that  are  relevant  to  a  query  tend  to 
cluster  together  is  well-known  within  the  document  retrieval 
community  and  is  usually  referred  to  as  the  clustering  hypoth¬ 
esis  [16]. 

5.1.3  Sum-based  Search  The  performance  of  Top¬ 
KAvg  and  ClustWt  depends  on  selecting  a  reasonable 
value  for  the  size  of  the  set  used  to  provide  automatic  rele¬ 
vance  feedback.  If  that  set  is  too  small,  it  may  not  incorpo¬ 
rate  a  sufficiently  large  number  of  active  compounds  and  thus 
lead  to  limited  (if  any  )  performance  improvements,  whereas 
if  the  set  is  too  large,  it  may  degrade  the  performance  by  in¬ 
corporating  a  relatively  large  number  of  inactive  compounds. 
Unfortunately,  our  initial  experiments  showed  that  the  right 
size  of  the  relevance  feedback  set  is  dataset  dependent. 

Motivated  by  this  observation  we  developed  a  scheme  for 
automatic  relevance  feedback,  which  instead  of  using  a  fixed 
number  of  compounds,  it  does  so  in  a  progressive  fashion. 
Specifically,  if  A  is  the  set  of  compounds  that  have  been  re¬ 
trieved  thus  far,  then  the  compound  selected  next,  cnext,  is  the 
one  that  has  the  highest  average  similarity  to  the  set  A  U  {q}. 
That  is, 

Cnext  =  arg  max{sim(ci,  A  U  {<?})}-  (2) 

CieD  —  A 

This  compound  is  added  in  A  and  the  overall  process  is  re¬ 
peated  until  the  desired  number  of  compounds  is  retrieved 
or  all  the  compounds  in  D  have  been  ranked.  Thus,  in  this 
scheme,  as  soon  as  a  compound  is  retrieved  it  is  used  to  ex¬ 
pand  the  set  of  compounds  used  to  provide  relevance  feed¬ 
back.  We  will  refer  to  this  method  as  BestSumDescSim. 

5.1.4  Max-based  Search  A  common  characteristic  to 
the  three  schemes  described  so  far  is  that  the  final  ranking  of 
each  compound  is  computed  by  taking  into  account  all  the 
similarities  between  the  compound  and  the  compounds  in  the 
relevance  feedback  set.  Since  the  compounds  in  the  relevance 
feedback  set  will  tend  to  be  structurally  similar  to  the  query 
compound  (with  the  ClustWt  potentially  being  an  excep¬ 
tion),  this  approach  is  rather  conservative  in  its  attempt  to 
identify  active  compounds  that  are  structurally  different  from 
the  query  (i.e.,  scaffold-hops). 

To  overcome  this  problem,  we  developed  a  best-search 
scheme  that  is  based  on  the  BestSumDescSim  approach 
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but  instead  of  selecting  the  next  compound  based  on  its  aver¬ 
age  similarity  to  A  U  {g},  it  selects  the  compound  that  is  the 
most  similar  to  one  of  the  compounds  in  A  U  {q\.  That  is,  the 
next  compound  is  given  by 

Cnext  =  argmax{  max  sim(ci,  Cj)}.  (3) 

Ci£D—A  Cj£Au{q} 

In  this  approach,  if  a  compound  Cj  other  than  q  has  the 
highest  similarity  to  some  compound  c,  in  the  database,  c,  is 
chosen  as  cnext  and  added  to  A  irrespective  of  its  similarity  to 
q.  Thus,  the  query-to-compound  similarity  is  not  necessarily 
included  in  every  iteration  as  in  the  other  schemes,  allowing 
BestMaxDescSim  to  identify  compounds  that  are  struc¬ 
turally  different  from  the  query.  We  will  refer  to  this  schemes 
as  BestMaxDescSim. 


5.2  Nearest-Neighbor  Graph-based  Methods 


These  methods,  motivated  by  the  field  of  social  (relational) 
network  analysis,  determine  the  similarity  between  a  pair  of 
compounds  by  taking  into  account  any  other  compounds  that 
are  very  similar  to  either  or  both  of  them.  Thus,  the  similarity 
depends  on  the  structure  of  the  network  formed  by  all  highly 
similar  pairs  of  compounds. 

The  network  linking  the  database  compounds  with  each 
other  and  with  the  query  is  determined  by  using  a  fc- 
nearest-neighbor  (NG)  and  a  k-mutual-nearest-neighbor 
(MG)  graph.  Both  of  these  graphs  contain  a  node  for  each 
of  the  compounds  as  well  as  a  node  for  the  query.  How¬ 
ever,  they  differ  on  the  set  of  edges  that  they  contain.  In  the 
fc-nearest-neighbor  graph  there  is  an  edge  between  a  pair  of 
nodes  corresponding  to  compounds  Ci  and  c3 ,  if  c,  is  in  the 
fc-nearest-neighbor  list  of  c3  or  vice-versa.  In  the  fc-mutual- 
nearest-neighbor  graph,  an  edge  exists  only  when  c,  is  in 
the  fc-nearest-neighbor  list  of  Cj  and  Cj  is  in  the  fc-nearest- 
neighbor  list  of  c, .  Asa  result  of  these  definitions,  each  node 
in  NG  will  be  connected  to  at  least  k  other  nodes  (assuming 
that  each  compound  has  a  non-zero  similarity  to  at  least  k 
other  compounds),  whereas  in  MG,  each  node  will  be  con¬ 
nected  to  at  most  k  other  nodes. 

Since  the  neighbors  of  each  compound  in  these  graphs  cor¬ 
respond  to  some  of  its  most  structurally  similar  compounds 
and  due  to  the  relation  between  structure  and  activity,  each 
pair  of  adjacent  compounds  will  tend  to  have  similar  activity. 
Thus,  these  graphs  can  be  considered  as  the  network  struc¬ 
tures  for  capturing  bioactivity  relations. 

A  number  of  different  approaches  have  been  developed  for 
determining  the  similarity  between  nodes  in  social  networks 
that  take  into  account  various  topological  characteristics  of 
the  underlying  graphs  [13,28].  In  our  work,  we  determine 
the  similarity  between  a  pair  of  nodes  as  a  function  of  the 
intersection  of  their  adjacency  lists,  which  takes  into  account 
all  two-edge  paths  connecting  these  nodes.  Specifically,  the 
similarity  between  c,  and  Cj  with  respect  to  graph  G  is  given 
by 


simG (a,  Cj) 


adjG(ci)  nadjG(cj) 
adjG(ci)  U  adjG(cj)  ’ 


(4) 


where  adjG(c;)  and  adj  G(cj)  are  the  adjacency  lists  of  Ci  and 
Cj  in  G,  respectively.  This  measure  assigns  a  high  similarity 


value  to  a  pair  of  compounds  if  both  are  very  similar  to  a  large 
set  of  common  compounds.  Since  a  pair  of  active  compounds 
will  be  more  similar  to  other  active  compounds  than  an  active- 
inactive  pair,  their  similarity  according  to  Equation  4  will  be 
high.  Also,  since  Equation  4  can  potentially  assign  a  high 
similarity  value  to  a  pair  of  compounds  even  if  their  direct 
similarity  is  very  low  (as  long  as  they  have  a  large  number  of 
common  neighbors),  it  facilitates  scaffold-hopping. 

For  each  of  the  NG  and  MG  graphs  we  developed  two  re¬ 
trieval  schemes  that  use  Equation  4  as  the  similarity  measure 
in  the  sum-  and  max-based  search  strategies  represented  in 
Equations  2  and  3.  For  example,  in  the  case  of  the  NG  graph 
and  the  sum-based  search  strategy,  the  next  compound  cnext 
to  be  retrieved  is  given  by 

Cnext  =  argmaxjsimjvG (ci,  A  U  {q})},  (5) 

CieD  —  A 

where  sim atG(c,;,  A  U  {g})  is  the  average  pairwise  similarity 
between  Ci  and  the  compounds  in  A  computed  using  Equa¬ 
tion  4  for  the  NG  graph.  The  equations  for  the  other  schemes 
are  derived  in  a  similar  fashion.  We  will  refer  to  these  four 
schemes  as  BestSumNG,  BestMaxNG,  BestSumMG, 
and  BestMaxMG,  respectively. 

6  Related  Work 

Many  methods  have  been  proposed  for  ranked-retrieval  and 
scaffold-hopping.  These  can  be  divided  into  two  groups.  The 
first  contains  methods  that  rely  on  better  designed  descriptor- 
space  representations,  whereas  the  second  contains  methods 
that  are  not  specific  to  any  descriptor-space  representation  but 
utilize  different  search  strategies  to  improve  the  overall  per¬ 
formance. 

Among  the  first  set  of  methods,  2D  descriptors  such  as 
path-based  fingerprints  [1,4],  dictionary  based  keys  [2,3]  and 
more  recently  Extended  Connectivity  fingerprints  (ECFP) 
[18],  Graph  Fragments  (GF)  [29]  have  all  been  successfully 
applied  for  the  retrieval  problem.  Pharmacophore  based  de¬ 
scriptors  such  as  ErG  [27]  have  been  shown  to  outperform 
simple  2D  topology  based  descriptors  for  scaffold-hopping 
[27,33].  Lastly,  descriptors  based  on  3D  structure  or  confor¬ 
mations  of  the  molecule  have  also  been  applied  successfully 
for  scaffold-hopping  [26,33]. 

The  second  set  of  methods  include  the  turbo  search 
schemes  (TurboSumFusion  and  TurboMaxFu- 
SION)  [17]  and  the  structural  unit  analysis  based  tech¬ 
niques  [32]  all  of  which  utilize  relevance  feedback  [6] 
ideas.  These  have  been  shown  to  be  effective  for  both 
scaffold-hopping  and  ranked-retrieval.  The  turbo  search 
techniques  operate  as  follows.  Given  a  query  q,  they  start  by 
retrieving  the  top-fc  compounds  from  the  database.  Let  A  be 
the  ( k  +  l)-size  set  that  contains  q  and  the  top-fc  compounds. 
For  each  compound  c  £  A,  all  the  compounds  in  the  database 
are  ranked  in  decreasing  order  based  on  their  similarity  to 
c,  leading  to  fc  +  1  ranked  lists.  These  lists  are  used  to 
obtain  the  final  similarity  of  each  compound  with  respect 
to  the  initial  query.  In  particular,  in  TurboMaxFusion, 
the  similarity  between  q  and  a  compound  c  is  equal  to  the 
similarity  corresponding  to  the  maximum  ranking  of  c  in  the 
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k  +  1  lists,  whereas  in  TurboSumFusion,  the  similarity 
is  equal  to  the  sum  of  all  the  similarities  in  these  rankings. 
Similar  methods  based  on  consensus  scoring,  rank  averaging, 
and  voting  have  been  investigated  in  [33], 

The  TurboSumFusion  approach  is  similar  to  that  of  the 
TopKAvg  described  in  Section  5.1.1  as  it  utilizes  relevance 
feedback  mechanism  to  re-rank  a  database  with  respect  to 
a  query.  However,  the  TurboSumFusion  approach  treats 
every  compound  in  the  top-/.:  set  as  equally  important  along 
with  the  query,  whereas  in  TopKAvg,  each  compound  in  A 
is  given  a  weight  of  (1  —  a){l/\A\a)  relative  to  q. 

7  Materials 

7.1  Datasets 

We  used  datasets  that  contain  compounds  that  bind  to  six 
different  biomolecular  targets:  COX2  (cyclooxygenase  2), 
CDK2  (cyclin-dependent  kinase  2),  FXa  (coagulation  factor 
Xa),  PDE5  (phosphodiesterase  5),  A1A  (alpha- 1 A  adreno¬ 
ceptor),  and  MAO  (Monoamineoxidase).  Each  of  these  sets 
represent  a  different  activity  class. 

The  datasets  for  the  first  five  targets  are  obtained  from 
[5, 19].  The  entire  set  consists  of  2142  compounds  and  there 
are  50  active  compounds  for  each  one  of  the  targets  (250  in 
total).  The  rest  of  the  compounds  are  “decoys”  (inactive)  ob¬ 
tained  from  the  National  Cancer  Institute  diversity  set.  For 
each  target,  we  constructed  a  dataset  that  contains  its  50  ac¬ 
tive  compounds  and  all  the  decoys.  These  datasets  are  termed 
as  COX2,  CDK2,  PDE5,  FXa  and  A1A. 

The  dataset  of  the  sixth  target  was  derived  from  [11,29] 
and  after  removing  compounds  with  impossible  Kekule  forms 
and  valence  errors  it  contains  1458  compounds.  The  com¬ 
pounds  in  this  dataset  have  been  categorized  into  four  differ¬ 
ent  classes,  0,  1,  2,  and  3  based  on  their  levels  of  activity, 
with  0  indicating  no  activity.  For  our  experiments  we  treat  all 
the  compounds  that  have  non-zero  activity  level  (268  com¬ 
pounds)  as  active. 

7.2  Definition  of  Scaffold-Hopping  Com¬ 
pounds 

Molecular  scaffold  is  a  widely  cited  concept  and  is  used 
to  evaluate  the  performance  of  a  method  with  respect  to 
its  scaffold-hopping  ability.  However  the  definition  of  a 
scaffold-hop  is  highly  subjective  with  numerous  papers  us¬ 
ing  different  criteria  to  define  what  constitutes  a  scaffold- 
hop  [10,17,32,33], 

In  this  paper  we  use  an  objective  way  of  defining  which 
compounds  can  be  considered  as  scaffold-hops  by  using  an 
approach  that  directly  relies  on  the  scaffold-hopping  prob¬ 
lem  definition  (Section  3).  In  particular,  for  a  given  query 
q,  the  active  compounds  are  ranked  based  on  their  structural 
similarity  to  q,  and  the  lowest  50%  of  them  are  defined  to 
be  the  scaffold-hops  for  q.  Thus,  this  approach  identifies  a 
set  of  scaffold-hopping  compounds  that  are  specific  to  each 
query  and  represent  the  50%  most  dissimilar  active  com¬ 
pounds  to  the  query.  We  use  the  2048-bit  path-based  finger¬ 
print  generated  by  Chemaxon’s  screen  program  [4]  for  mea¬ 


suring  the  structural  similarity  between  a  query  and  an  active 
compound.  These  fingerprints  are  well-designed  to  capture 
structural  similarity  between  two  compounds  [27]. 

7.3  Experimental  Methodology 

All  the  experiments  were  performed  on  dual  core  AMD 
Opterons  with  4  GB  of  memory.  We  used  the  descriptor- 
spaces  GF,  ECZ3,  and  ErG  (described  in  Section  4)  for  the 
evaluating  the  methods  introduced  in  this  paper.  Each  method 
is  tested  against  six  datasets  (Section  7.1)  using  three  different 
descriptor-spaces  (Section  4)  leading  to  a  total  of  18  different 
combinations  of  datasets  and  descriptor-spaces.  We  will  refer 
to  them  as  18  different  problems. 

We  use  the  Tanimoto  similarity  [8,  30,  31]  for  all  direct 
similarity  calculations.  The  Tanimoto  similarity  function  is 
given  by 

n 

yi  CikCjk 

sim (ci,  Cj )  =  — - ^ - - - ,  (6) 

E  (Cifc)2  +  E  (Gfc)2  -  E  CikCjk 

fc= 1  k=  1  k=l 

where  and  Cjk  are  the  values  for  the  kth  dimension  in  the 
rc-dimensional  descriptor-space  representation  for  the  com¬ 
pounds  Ci  and  Cj,  respectively.  This  similarity  function  was 
selected  because  it  has  been  shown  to  be  an  effective  way 
of  measuring  the  similarity  between  chemical  compounds 
[30, 3 1  ]  for  ranked-retrieval  and  is  the  most  widely-used  sim¬ 
ilarity  function  in  cheminformatics. 

For  each  dataset  we  used  each  of  its  active  compounds  as  a 
query  and  evaluated  the  extent  to  which  the  various  methods 
lead  to  effective  retrieval  of  the  other  active  compounds  and 
scaffold-hops.  For  ClustWt  we  used  hMETIS  [20,21]  to 
perform  the  clustering  into  fixed  sized  clusters. 

We  varied  the  parameter  values  for  the  methods  described 
in  Section  5  and  obtained  results  by  averaging  over  four  dif¬ 
ferent  sets  of  values.  For  TopKAvg,  which  depends  on  the 
number  of  compounds  k  used  in  relevance  feedback,  we  used 
k  =  5,  10,  15,  and  20.  For  ClustWt,  which  depends  on  the 
cluster  size  m  and  the  number  of  compounds  k  on  which  the 
clustering  was  performed,  we  used  m  =25  and  40  and  k  = 
200  and  400.  These  parameter  values  were  selected  because 
they  gave  the  best  results  in  our  experiments.  For  the  nearest- 
neighbor  methods  which  depend  on  the  number  of  neighbors, 
we  used  k  =  4,  6,  8,  and  10  for  the  BestSumNG  and  Best- 
MaxNG,  and  k  =  12,  16,  20,  and  24  for  the  BestSumMG 
and  BestMaxMG  schemes.  These  values  were  chosen  be¬ 
cause  they  gave  good  results.  Moreover,  for  NG  the  value 
of  k  less  than  4  leads  to  graphs  with  many  connected  compo¬ 
nents  whereas  for  MG  this  value  is  12.  Hence,  we  decided  not 
to  use  values  below  these  thresholds.  Note  that  the  threshold 
for  NG  is  less  than  that  of  MG  because  the  criterion  for  an 
edge  to  exist  between  two  nodes  of  the  neighborhood  graph 
is  stricter  for  MG  as  opposed  to  NG  (Section  5.2). 

We  also  compared  our  schemes  against  TurboMaxFu- 
SION  and  TurboSumFusion  [17],  For  both  these  methods, 
we  used  k  =  5,  10,  15,  and  20.  These  values  gave  the  best 
results  and  the  results  degraded  as  k  was  further  increased. 
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7.4  Standard  Retrieval 

For  each  problem,  we  obtain  a  baseline  performance  by  rank¬ 
ing  all  the  compounds  with  respect  to  each  active  compound 
using  the  Tanimoto  similarity.  We  call  this  Standard  Retrieval 
and  denote  it  by  StdRet. 

7.5  Performance  Assessment  Measures 

We  measure  ranked-retrieval  and  scaffold-hopping  perfor¬ 
mance  using  uninterpolated  precision  [16].  This  is  calculated 
as  follows.  For  each  active  that  appears  in  the  top  50  retrieved 
compounds  we  compute  the  precision  value.  For  ranked- 
retrieval  this  is  defined  as  the  ratio  of  the  number  of  actives 
retrieved  over  the  number  of  compounds  retrieved  thus  far. 
For  scaffold-hopping  it  is  defined  as  the  number  of  scaffold- 
hops  retrieved  over  the  number  of  compounds  retrieved  thus 
far.  For  both  ranked-retrieval  and  scaffold-hopping  we  sum 
all  their  precision  values  and  normalized  them  by  dividing 
them  with  50.  This  is  called  the  total  uninterpolated  precision 
for  a  query.  Similar  values  are  obtained  for  all  the  queries  for 
a  dataset  and  the  total  uninterpolated  precision  is  the  average 
of  all  these  values.  Note  that  the  total  uninterpolated  precision 
captures  the  number  of  active  compounds  (scaffold-hops)  for 
each  query  as  well  as  the  position  (rank)  information  of  the 
actives  (scaffold-hops). 

To  compare  the  ranked-retrieval  or  scaffold-hopping  per¬ 
formance  of  two  methods,  we  evaluate  their  relative  perfor¬ 
mance  over  all  the  18  problems.  This  is  achieved  as  fol¬ 
lows.  Let  r,;  and  q,  represent  the  ranked-retrieval  or  scaffold¬ 
hopping  performance  achieved  by  methods  r  and  q  on  the  ith 
problem  respectively.  We  calculate  the  log-ratio,  log2  (ri/qi), 
for  every  problem  and  take  the  average  of  these  values.  We 
call  this  quantity  the  Average  Relative  Performance  or  ARP 
of  r  with  respect  to  q.  On  the  average,  if  the  ARP  is  less 
than  zero,  r  performs  worse  than  q  whereas  if  the  ARP  is 
greater  than  zero,  r  performs  better  than  q.  Note  that  the  rea¬ 
son  that  we  use  log-ratios  as  opposed  to  simply  the  ratios  is 
that  the  distribution  of  the  ratios  of  two  random  variables  is 
not  symmetric  whereas  the  distribution  of  their  log-ratios  is 
normally  distributed.  This  allows  us  to  compute  their  aver¬ 
age  and  compare  them  in  an  unbiased  way.  We  also  assess 
whether  the  ARP  for  a  given  pair  of  methods  is  statistically 
significant  using  the  student’s  t-test  [7],  which  is  well-suited 
to  assess  statistical  significance  of  a  sample  of  values  drawn 
out  of  a  normal  distribution. 

8  Results 

8.1  Overall  Performance  Assessment 

Tables  1  and  2  compare  the  performance  of  all  the  methods  in 
a  pairwise  fashion  for  scaffold-hopping  and  ranked-retrieval, 
respectively.  In  each  of  these  tables  we  present  two  statistics. 
The  first  is  the  ARP  of  the  row  method  (r)  with  respect  to  the 
column  method  (q)  as  described  in  Section  7.5.  The  second 
statistic,  shown  immediately  below  the  ARP  value  in  paren¬ 
thesis,  is  its  p-value  obtained  from  the  student’s  t-test.  Note 
that  for  the  remainder  of  this  section  we  will  define  the  ARP 
of  the  two  methods  to  be  statistically  significant  if  p  <  0.01. 


The  rest  of  this  section  highlights  some  of  the  key  observa¬ 
tions  that  can  be  made  by  analyzing  the  results  in  these  tables. 

8.1.1  Performance  of  Relevance  Feedback  Meth¬ 
ods  Comparing  the  performance  of  the  four  relevance- 
feedback-based  methods  described  in  Section  5.1  against 
StdRet,  we  see  that  all  of  them  lead  to  better  scaffold¬ 
hopping  results.  Among  them,  the  results  achieved  by 
ClustWt  and  BestSumDescSim  are  63%  and  94%  better 
than  StdRet,  respectively  and  also  these  improvements  are 
statistically  significant.  However,  all  four  of  these  methods 
achieve  somewhat  worse  ranked-retrieval  performance  (3% 
to  15%).  Moreover,  these  differences  are  statistically  signifi¬ 
cant  for  BestSumDescSim  and  BestMaxDescSim. 

Comparing  the  four  methods  against  TurboSumFusION 
and  TurboMaxFusion,  we  observe  that  the  relative  perfor¬ 
mance  of  most  of  these  methods  varies,  with  some  methods 
doing  better  for  scaffold-hopping  and  others  doing  better  for 
ranked-retrieval.  However,  with  the  exception  of  TopKAvg, 
which  is  statistically  better  than  the  two  fusion-based  scheme 
for  ranked-retrieval,  all  other  differences  are  not  statistically 
significant. 

Comparing  the  four  relevance-feedback-based  methods 
against  each  other  we  see  that  most  of  them  perform  the  same 
for  both  scaffold-hopping  and  ranked-retrieval  and  whatever 
differences  that  exist  are  not  statistically  significant.  De¬ 
spite  of  this,  the  average  performance  of  BestSumDesc¬ 
Sim  is  better  than  BestMaxDescSim,  indicating  that  the 
sum-based  search  strategy  leads  to  better  results.  The  results 
also  show  that  the  ClustWt  is  better  than  TopKAvg  for 
scaffold-hopping  and  that  this  difference  is  statistically  sig¬ 
nificant. 

8.1.2  Performance  of  Nearest-Neighbor  Graph- 
Based  Methods  Comparing  the  performance  of  the 
nearest-neighbor  methods,  we  observe  that  all  of  these 
schemes  show  good  performance  for  scaffold-hopping  as 
well  as  ranked-retrieval.  Among  them,  the  best  perform¬ 
ing  method  is  BestSumNG.  It  achieves  the  best  balance 
between  the  ranked-retrieval  and  scaffold-hopping  perfor¬ 
mance.  Furthermore,  similar  to  the  relevance  feedback-based 
methods,  the  sum-based  search  methods  outperform  the  cor¬ 
responding  max-based  methods.  However,  these  differences 
are  not  statistically  significant. 

The  results  also  show  that  the  nearest-neighbor  methods 
performs  significantly  better  than  all  the  other  methods  for 
scaffold-hopping  and  most  of  these  differences  are  statisti¬ 
cally  significant  (BestSumDescSim  and  BestMaxDesc¬ 
Sim  are  the  two  exceptions).  In  particular,  the  performance 
of  the  nearest-neighbor  methods  are  62%  to  300%  better  than 
the  StdRet  and  the  fusion-based  methods  and  46%  to  244% 
better  than  the  relevance-feedback-based  methods. 

The  nearest-neighbor  methods  also  achieve  better  perfor¬ 
mance  than  all  of  the  methods  for  ranked-retrieval,  although 
most  of  these  differences  are  not  statistically  significant. 
BestSumNG  is  a  clear  exception  as  its  ranked-retrieval  per¬ 
formance  is  also  significantly  and  statistically  better  than  all 
the  other  non  graph-based  techniques.  For  example,  com- 


6 


pared  to  the  fusion-based  techniques  its  ranked-retrieval  per¬ 
formance  is  62%  to  209%  better. 

8.2  Performance  of  Descriptor-Spaces  and 
Datasets 

Our  discussion  so  far  focused  on  evaluating  the  average 
performance  of  the  different  methods  across  the  various 
descriptor-space  representations  and  datasets.  In  this  sec¬ 
tion  we  analyze  the  performance  of  the  methods  on  the  in¬ 
dividual  descriptor-spaces  and  datasets.  We  limit  our  eval¬ 
uation  to  only  the  ClustWt  and  the  BestSumNG  meth¬ 
ods  as  these  methods  achieve  the  best  scaffold-hopping  and 
ranked-retrieval  performance  among  the  relevance-feedback- 
and  graph-based  methods,  respectively. 

The  results  of  these  evaluations  are  shown  in  Fig¬ 
ures  1  and  2,  which  compare  the  performance  of  StdRet 
against  ClustWt  and  BestSumNG,  respectively.  In  these 
figures,  the  left  Y-axis  represents  uninterpolated  precision 
values  for  ranked-retrieval,  whereas  the  right  Y-axis  repre¬ 
sents  uninterpolated  precision  values  for  scaffold-hopping. 
For  ClustWt  and  BestSumNG  we  also  show  error  bars 
that  correspond  to  the  standard  deviation  of  the  results  ob¬ 
tained  for  the  four  sets  of  parameter  values  used  for  these 
schemes. 

These  results  show  that  for  scaffold-hopping,  ClustWt 
outperforms  StdRet  in  most  dataset  and  descriptor-space 
combinations.  However,  the  actual  performance  gains 
are  dataset  and  descriptor-space  dependent.  For  example, 
ClustWt  achieves  significant  gains  on  the  A1A  and  FXa 
datasets  for  the  ErG  and  ECZ3  descriptor-spaces,  whereas 
the  gains  for  the  other  datasets  and/or  descriptor-spaces  are 
not  as  dramatic.  In  terms  of  ranked-retrieval  performance, 
these  results  show  that  in  the  case  of  the  GF  descriptor- 
space,  ClustWt  performs  consistently  better  than  StdRet 
across  all  datasets.  However,  ClustWt’s  ranked-retrieval 
performance  for  the  other  two  descriptor-spaces  is  somewhat 
mixed. 

Finally,  the  results  in  Figure  2  show  that  for  scaffold¬ 
hopping,  BestSumNG  performs  consistently  better  than 
StdRet  for  all  the  descriptor-space  and  dataset  combina¬ 
tions.  However,  similarly  to  ClustWt,  the  actual  gains 
are  dataset  and  descriptor-space  dependent.  For  example, 
the  gains  are  particularly  high  for  the  FXa,  A1A,  and  COX2 
datasets  and  for  the  ErG  descriptor  space.  Similar  trends 
can  be  observed  with  the  ranked-retrieval  results,  with  Best¬ 
SumNG  outperforming  StdRet.  Moreover,  the  perfor¬ 
mance  gains  achieved  on  some  problems  by  BestSumNG 
are  usually  much  higher  than  the  performance  degradations 
in  others. 

9  Conclusion 

In  this  paper  we  introduced  a  number  of  methods  based  on 
relevance  feedback  and  social  (relational)  network  analysis 
to  improve  scaffold-hopping  and  ranked-retrieval.  Our  re¬ 
sults  showed  that  among  these  methods,  the  ones  based  on 
social  network  analysis  consistently  and  substantially  outper¬ 
form  the  standard  retrieval  as  well  as  previously  introduced 


methods  for  these  problems. 

10  Acknowledgement 

This  work  was  supported  by  NSF  EIA-9986042,  ACI-0 133464,  IIS- 

0431135,  NIH  RLM008713A,  the  Army  High  Performance  Computing  Re¬ 
search  Center  contract  number  DA  AD  19-0 1-2-00 14,  and  by  the  Digital 

Technology  Center  at  the  University  of  Minnesota. 

References 

[1]  http://www.daylight.com.  Daylight  Inc. 

[2]  http://www.digitalchemistry.co.uk/.  Digital  Chemistry 
Inc. 

[3]  http://www.mdl.com.  MDL  Information  Systems  Inc. 

[4]  www.chemaxon.com.  ChemAxon  Inc. 

[5]  www.cheminformatics.org.  Cheminformatics. 

[6]  Ricardo  Baeza- Yates  and  Berthier  Ribeiro-Neto.  Mod¬ 
ern  information  retrieval.  Addison  Wesley  1999. 

[7]  J.  M.  Bland.  An  introduction  to  medical  statistics. 
(1995)  2nd  edn.  Oxford  University  Press. 

[8]  H.J.  Bohm  and  G.  Schneider.  Virtual  screening  for 
bioactive  molecules.  Wiley-VCH,  2000. 

[9]  Gianpaolo  Bravi,  Emanuela  Gancia,  Darren  Green,  V.S. 
Hann,  and  M.  Mike.  Modelling  structure-activity  re¬ 
lationship.  Virtual  Screening  for  Bioactive  Molecules, 
2000. 

[  10]  N.  Brown  and  E.  Jacoby.  On  scaffolds  and  hopping  in 
medicinal  chemistry.  Mini  Rev  Medicinal  Chemistry, 
6(11):  1217-1229,  2006. 

[11]  R.  Brown  and  Y.  Martin.  Use  of  structure-activity  data 
to  compare  structure -based  clustering  methods  and  de¬ 
scriptors  for  use  in  compound  selection.  J.  Client.  Info. 
Model.,  36(1):576— 584,  1996. 

[  12]  Mukund  Deshpande,  Michihiro  Kuramochi,  Nikil  Wale, 
and  George  Karypis.  Frequent  substructure-based  ap¬ 
proaches  for  classifying  chemical  compounds.  IEEE 
TKDE.,  17(8):  1036—1050,  2005. 

[13]  F.  Fouss,  A.  Pirotte,  J.  Renders,  and  M.  Saerens.  Ran¬ 
dom  walk  computation  of  similarities  between  nodes  of 
a  graph  with  application  to  collaborative  filtering.  IEEE 
TKDE,  19(3):355— 369,  2007. 

[  14]  V.  J.  Gillet,  P.  Willet,  and  J.  Bradshaw.  Similarity 
searching  using  reduced  graphs.  J.  Chem.  Inf.  Comput. 
Sci.,  43:338-345,  2003. 

[15]  G.  Harper,  G.S.  Bravi,  S.D.  Pickett,  J.  Hussain,  and 

D. V.  Green.  The  reduced  graph  descriptor  in  virtual 
screening  and  data-driven  clustering  of  high-throughput 
screening  data.  J.  Chem.  Info.  Model.,  44(6):45-56, 
2004. 

[16]  Marti  Hearst  and  Jan  Pedersen.  Reexamining  the 
cluster  hypothesis:  Scatter/gather  on  retrieval  results. 
ACM/SIGIR ,  1996. 

[17]  J.  Hert,  P.  Willet,  and  D.  Wilton.  New  methods  for 
ligand  based  virtual  screening:  Use  of  data  fusion  and 
machine  learning  to  enchance  the  effectiveness  of  sim¬ 
ilarity  searching.  J.  Chem.  Info.  Model.,  (46):462-470, 
2006. 

[18]  J.  Hert,  P.  Willet,  D.  Wilton,  P.  Acklin,  K.  Azzaoui, 

E.  Jacoby,  and  A.  Schuffenhauer.  Comparison  of  topo¬ 
logical  descriptors  for  similarity-based  virtual  screening 
using  multiple  bioactive  reference  structures.  Organic 
and  Biomolecular  Chemistry,  2:3256-3266,  2004. 

[19]  Robert  N.  Jorissen  and  Michael  K.  Gibson.  Virtual 
screening  of  molecular  databases  using  support  vector 
machines.  J.  Chem.  Info.  Model,  45(3):549-561,  2005. 


7 


Table  1 :  Performance  for  Scaffold-Hopping. 
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Table  2:  Performance  for  Ranked-Retrieval. 
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(0.158) 
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The  top  entry  in  each  cell  corresponds  to  the  average  of  the  log2  ratios  of  the  uninterpolated  precision  of  the  row  method  to  the  column  method  for  the 
18  problems.  The  number  below  this  entry,  in  parenthesis,  corresponds  to  the  p- value  obtained  from  the  student’s  t-test  for  that  entry. 


Figure  1:  StdRet  versus  ClustWt. 
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Figure  2:  StdRet  versus  BestSumNG. 
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